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Abstract 

We derive an upper bound on the local Rademacher complexity of £p-norm multiple kernel 
learning, which yields a tighter excess risk bound than global approaches. Previous local 
approaches aimed at analyzed the case p — 1 only while our analysis covers all cases 
1 < p < 00, assuming the different feature mappings corresponding to the different kernels 
to be uncorrelated. We also show a lower bound that shows that the bound is tight, 
and derive consequences regarding excess loss, namely fast convergence rates of the order 
0{n~ 1+0 ), where a is the minimum eigenvalue decay rate of the individual kernels. 
Keywords: multiple kernel learning, learning kernels, generalization bounds, local 
Rademacher complexity 



1. Introduction 

Propelled by the increasing "industrialization" of modern application domains such as bioin- 
formatics or computer vision leading to the accumulation of vast amounts of data, the 

*A Part of the work was done while MK was at Learning Theory Group, Computer Science Division and 
Department of Statistics, University of California, Berkeley, CA 94720-1758, USA. 
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past decade experienced a rapid professionalization of machine learning methods. Sophisti- 
cated machine learning solutions such as the support vector machine can nowadays almost 
completely be applied out-of-the-box (Bouckaert et al., 2010). Nevertheless, a displeasing 
stumbling block towards the complete automatization of machine learning remains that of 
finding the best abstraction or kernel for a problem at hand. 

In the current state of research, there is little hope that a machine will be able to find 
automatically — or even engineer — the best kernel for a particular problem (Searle, 1980). 
However, by restricting to a less general problem, namely to a finite set of base kernels 
the algorithm can pick from, one might hope to achieve automatic kernel selection: clearly, 
cross-validation based model selection (Stone, 1974) can be applied if the number of base 
kernels is decent. Still, the performance of such an algorithm is limited by the performance 
of the best kernel in the set. 

In the seminal work of Lanckrict ct al. (2004) it was shown that it is computationally 
feasible to simultaneously learn a support vector machine and a linear combination of kernels 
at the same time, if we require the so-formed kernel combinations to be positive definite and 
trace-norm normalized. Though feasible for small sample sizes, the computational burden 
of this so-called multiple kernel learning (MKL) approach is still high. By further restricting 
the multi-kernel class to only contain convex combinations of kernels, the efficiency can be 
considerably improved, so that ten thousands of training points and thousands of kernels 
can be processed (Sonnenburg et al., 2006). 

However, these computational advances come at a price. Empirical evidence has accu- 
mulated showing that sparse-MKL optimized kernel combinations rarely help in practice 
and frequently are to be outperformed by a regular SVM using an unweighted-sum kernel 
K = J2m^'rn (Cortes et al., 2008; Gehler and Nowozin, 2009), leading for instance to the 
provocative question "Can learning kernels help performance?" (Cortes, 2009). 

By imposing an iq-noim, q > 1, rather than an £i penalty on the kernel combination 
coefficients, MKL was finally made useful for practical applications and profitable (Kloft ct 
al., 2009, 2011). The £q-norm MKL is an empirical minimization algorithm that operates 
on the multi-kernel class consisting of functions / : x {w,(f)k{x)) with \\w\\j^ < D, 
where (f>k is the kernel mapping into the reproducing kernel Hilbert space (RKHS) T-Lk with 
kernel k and norm while the kernel k itself ranges over the set of possible kernels 

{k = Z^=lOmkm I ||^||,<1, ^>0}. 

In Figure 1, we reproduce exemplary results taken from Kloft et al. (2009, 2011) (see 
also references therein for further evidence pointing in the same direction). We first observe 
that, as expected, £g-norm MKL enforces strong sparsity in the coefficients 6m when g = 1, 
and no sparsity at all for g = oo, which corresponds to the SVM with an unweighted-sum 
kernel, while intermediate values of q enforce different degrees of soft sparsity (understood 
as the steepness of the decrease of the ordered coefficients 0^)- Crucially, the performance 
(as measured by the AUG criterion) is not monotonic as a function of g = 1 (sparse 
MKL) yields significantly worse performance than q = oo (regular SVM with sum kernel), 
but optimal performance is attained for some intermediate value of q. This is an empirical 
strong motivation to study theoretically the performance of £^-MKL beyond the limiting 
cases g = 1 or g = oo. 



2 



0.93 



0.92 



-1-norm MKL 
-4/3-norm MKL 
-2-norm MKL 
-4-norm MKL 
-SVM 



10K 



20K 30K 40K 
sample size 



50K 



60K 



1-norm 4/3-norm 2-norm 4-norm unw.-sum 



L_ L Ifflna 



DC 
I L t.^ t 



Figure 1: Splice site detection experiment in Kloft et al. (2009, 2011). Left: The Area 
under ROC curve as a function of the training set size is shown. The regular SVM is 
equivalent to (7 = oo (or p = 2). Right: The optimal kernel weights 9^ as output by 
£o-norm MKL are shown. 



A conceptual milestone going back to the work of Bach et al. (2004) and Micchelli 
and Pontil (2005) is that the above multi-kernel class can equivalently be represented as 
a block-norm regularized linear class in the product Hilbert space % := Hi x • • • x Hm, 
where Tim denotes the RKHS associated to kernel km, 1 < m < M. More precisely, 
denoting by the kernel feature mapping associated to kernel km over input space X, and 
4> : X G X {(j)i{x), . . . , (j)Mix)) G Ti, the class of functions defined above coincides with 

Hp^D,M = {fn,-x^{w,^ix)) I K; = (t(;«,...,t«W),||to||2_p<Z)}, (1) 

where there is a one-to-one mapping of g G [1, 00] to p G [1,2] given hy p = The £2,p- 

norm is defined here as ll^ll^^^ := || (||t(;(^)||fc,, . . . , H^^*"^^ |Um) ||p = (Em=i 11'^^™'' IIL)^''^' 
for simplicity, we will frequently write llw^'")!! = . 

Clearly, learning the complexity of (1) will be greater than one that is based on a single 
kernel only. However, it is unclear whether the increase is decent or considerably high 
and — since there is a free parameter p — how this relates to the choice of p. To this end the 
main aim of this paper is to analyze the sample complexity of the above hypothesis class 
(1). An analysis of this model, based on global Rademacher complexities, was developed 
by Cortes et al. (2010). In the present work, we base our main analysis on the theory of 
local Rademacher complexities, which allows to derive improved and more precise rates of 
convergence. 

Outline of the contributions. This paper makes the following contributions: 

• Upper bounds on the local Rademacher complexity of £p-norm MKL are shown, from 
which we derive an excess risk bound that achieves a fast convergence rate of the order 
0{n i+"), where a is the minimum eigenvalue decay rate of the individual kernels 
(previous bounds for ^p-norm MKL only achieved 0(n~2). 
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• A lower bound is shown that beside absolute constants matches the upper bounds, 
showing that our results are tight. 

• The generalization performance of ^p-norm MKL as guaranteed by the excess risk 
bound is studied for varying values of p, shedding light on the appropriateness of a 
small/large p in various learning scenarios. 

Furthermore, wc also present a simpler proof of the global Rademacher bound shown in 
Cortes et al. (2010). A comparison of the rates obtained with local and global Rademacher 
analysis, respectively, can be found in Section 6.1. 

Notation. For notational simplicity we will omit feature maps and directly view (f){x) 
and (pmix) as random variables x and x^'^^ taking values in the Hilbert space H and Hm, 
respectively, where x = {x^^\ . . . ,x^^^). Correspondingly, the hypothesis class we are 
interested in reads Hp^D,M = {fw '■ x i— t- {w,x) | ||if||2p ^ If -D or M are clear from 
the context, we sometimes synonymously denote Hp = Hp^o = Hp^jj^M- We will frequently 
use the notation (w^™))^^^ for the element u = (u^^\ . . . , u^^^^) e H = Hi x . . . x TIm- 

We denote the kernel matrices corresponding to k and km by K and Km, respectively. 
Note that we are considering normalized kernel Gram matrices, i.e., the ijth entry of K 
is ^k{xi,Xj). We will also work with covariance operators in Hilbert spaces. In a finite 
dimensional vector space, the (uncentered) covariance operator can be defined in usual 
vector /matrix notation as Exx'^ . Since we are working with potentially infinite-dimensional 
vector spaces, we will use instead of xx~^ the tensor notation x ^ x £ HS(7^), which is a 
Hilbert-Schmidt operator % ^ % defined as {x ® x)u = {x,u)x. The space IIS('H) of 
Hilbert-Schmidt operators on H is itself a Hilbert space, and the expectation Ea;(8)a; is well- 
defined and belongs to HS(H) as soon as E ||a;||^ is finite, which will always be assumed (as 
a matter of fact, we will often assume that ||a:;|| is bounded a.s.). Wc denote by J = Kxf^x, 
Jm = IEa;(™) (gjaj^™") the uncentered covariance operators corresponding to variables x, x^"^^; 
it holds that tr(J) =E||£c||2 and tr(J^) = E ||a;(™) H^. 

Finally, for p G [1, oo] we use the standard notation p* to denote the conjugate of p, 
that is, p* G [1, oo] and ^ + j^ = l. 



2. Global Rademacher Complexities in Multiple Kernel Learning 

We first review global Rademacher complexities (GRC) in multiple kernel learning. Let 
aji, . . . , be an i.i.d. sample drawn from P. The global Rademacher complexity is defined 
as R{Hp) = W.s\iY>f^^Hp{w,}iYH=i(^iXi), where {cFi)i<i<n is an i.i.d. family (independent 
of [xi) ) of Rademacher variables (random signs). Its empirical counterpart is denoted by 
R{Hp) =¥.[R{Hp)\xi, . . . ,Xri\ =¥,„sViY>f^(zHp{w,}^YIi=i'^iXi)- The interest in the global 
Rademacher complexity comes from that if known it can be used to bound the generalization 
error (Koltchinskii, 2001; Bartlett and Mendelson, 2002). 

In the recent paper of Cortes et al. (2010) it was shown using a combinatorial argument 
that the empirical version of the global Rademacher complexity can be bounded as 
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m=l 
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where c = || and tv{K) denotes the trace of the kernel matrix K. We will now show a 
quite short proof of this result and then present a bound on the population version of the 
GRC. The proof presented here is based on the Khintchine-Kahane inequality (Kahane, 
1985) using the constants taken from Lemma 3.3.1 and Proposition 3.4.1 in Kwapien and 
Woyczyhski (1992). 

Lemma 1 (Khintchine-Kahane inequahty). Let be Vi, . . . ,vm £ Then, for any q > 1, 
it holds 

n n 

i=l i=l 
where c = max(l,p* — 1). In particular the result holds for c = p* . 
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Proposition 2 (Global Rademacher complexity, empirical version). For any p > 1 the em- 
pirical version of global Rademacher complexity of the multi-kernel class Hp can be bounded 
as 



\/t>p: R{Hp) < D , 



't* 



n 



(tr(K„)) 



M 

m=l 



Proof First note that it suffices to prove the result for t = p as trivially ||a;||2t < Il*ll2p 
holds for ell t > p and therefore R{Hp) < R{Ht). We can use a block-structured version 
of Holder's inequality (cf. Lemma 12) and the Khintchine-Kahane (K.-K.) inequality (cf. 
Lemma 1) to bound the empirical version of the global Rademacher complexity as follows: 



d f 1 " 

R{Hp) ='E(^ sup {w, - y^Q-jXi) 



Holder 



Jensen 



I 1 " 

I n ^-^ 

1=1 

M 



\2,p* 



m=l i=l 
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what was to show. 



Remark. Note that there is a very good reason to state the above bound in terms of 

t > p instead of solely in terms of p: the Rademacher complexity R{Hp) is not monotonic 
in p and thus it is not always the best choice to take t := p in the above bound. This can 
is readily seen, for example, for the easy case where all kernels have the same trace — in 
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that case the bound translates into R{Hp) < D yt*Mt* ^'^'■^^^ Interestingly, the function 
X I-)- xM'^I^ is not monotone and attains its minimum for a; = 2 log M, where log denotes 
the natural logarithm with respect to the base e. This has interesting consequences: for 

any p < (21ogM)* we can take the bound R{Hp) < D which has only a 

mild dependency on the number of kernels; note that in particular we can take this bound 
for the £i-norm class R{Hi) for all M > 1. 

Despite the simplicity the above proof, the constants are slightly better than the ones 
achieved in Cortes et al. (2010). However, computing the population version of the global 
Rademacher complexity of MKL is somewhat more involved and to the best of our knowledge 
has not been addressed yet by the literature. To this end, note that from the previous proof 

* 1 

we obtain R{Hp) = E D,/^Jii{ ^^^^ (i Y.7=i W^i""^ WnJ^)"^ ■ thus can use Jensen's 
inequality to move the expectation operator inside the root, 

M ^ n * 1 

m=l 1=1 

but now need a handle on the ^-th moments. To this aim we use the inequalities of 
Rosenthal (1970) and Young (e.g., Steele, 2004) to show the following Lemma. 

Lemma 3 (Rosenthal + Young) . Let Xi, . . . , X„ be independent nonnegative random vari- 
ables satisfying : < B < oo almost surely. Then, denoting Cq = {2qeY , for any q>\ 
it holds 

i=l ' ^ i=\ 

The proof is defered to Appendix A. It is now easy to show: 

Corollary 4 (Global Rademacher complexity, population version). Assume the kernels 
are uniformly bounded, that is, \\k\\^ < B < oo, almost surely. Then for any p > 1 the 
population version of global Rademacher complexity of the multi-kernel class Hp can be 
bounded as 

VB^DMJ^t* 
+ . 



m>p: R{Hp,D,M) <Dt* ^^||(tr(J„))^^, 



For t > 2 the right-hand term can be discarded and the result also holds for unbounded 
kernels. 

Proof As above in the previous proof it suffices to prove the result for t = p. Prom (2) we 
conclude by the previous Lemma 

1 

M ■ ' 



\ m=l ^ 




n 



=tr(Jm,) 

< Dp* 



(tr(J^))^ 



m=l 



VB^DM^p* 
+ — 
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where for the last inequahty we use the subadditivity of the root function. Note 
that for p > 2 it is p*/2 < 1 and thus it suffices to employ Jensen's inequahty instead 
of the previous lemma so that we come along without the last term on the right-hand side. ■ 

For example, when the traces of the kernels are bounded, the above bound is essentially 

1 

determined by o(^ ^ ^ ^ . We can also remark that by setting t = (log(M))* we obtain 
the bound R{Hi) = o(^y 

3. The Local Rademacher Complexity of Multiple Kernel Learning 

Let a;i, . . . , be an i.i.d. sample drawn from P. We define the local Rademacher complex- 
ity of Hp as RriHp) = Esupf^^H^,pf2^<^{w, ^ Yli=i '^i^i)^ where P fl, := E{f^^{x))'^. Note 
that it subsumes the global RC as a special case for r = oo. As self-adjoint, positive Hilbert- 
Schmidt operators, covariance operators enjoy discrete eigenvalue-eigenvector decomposi- 
tions J = Ex(^x = J2JLi ^j'^i ® ""i and Jrr, = Ea;(™) (g) a;(™) = J2T=i aJ™^^™^ «> u'f\ 

where (wj)j>i and (Wj™^)j>i form orthonormal bases of H and T-Lm: respectively. 
We will need the following assumption for the case 1 < p < 2: 

Assumption (U) (no-correlation). The Hilbert space valued variables xi, . . . ,Xm are 

said to be (pairwise) uncorrelated if for any m ^ m' and w G TimjW' G ^m' ? i^e real 
variables {w,Xm) and {w',Xm') are uncorrelated. 

Since 7im,y-m' are RKHSs with kernels km, km', if we go back to the input random 
variable in the original space X e X, the above property is equivalent to saying that for 
any fixed t,t' G X, the variables km{X,t) and km'iX,t') are uncorrelated. This is the 
case, for example, if the original input space X is M^^, the orginal input variable X X has 
independent coordinates, and the kernels ki, . . . , kM each act on a different coordinate. Such 
a setting was considered in particular by Raskutti et al. (2010) in the setting of £i-penalized 
MKL. We discuss this assumption in more detail in Section 6.2. 

We are now equipped to state our main results: 

Theorem 5 (Local Rademacher complexity, p G [1,2] ). Assume that the kernels are uni- 
formly bounded (\\k\\^ < B < oo) and that Assumption (U) holds. The local Rademacher 
complexity of the multi-kernel class Hp can be bounded for any 1 < p < 2 as 



VtGb,2]: Rr{Hp)< 



\ 



16 
n 



oo M 

^ min (rM^-^,ceDH*^\f'A ) 

y=l ^ W m=l 



+ 

n 

2 



Theorem 6 (Local Rademacher complexity, p > 2). The local Rademacher complexity of 
the multi-kernel class Hp can be bounded for any p> 2 as 



Rr{Hp) < 



2 ~ 



min(r, D'^Mp* ^Xj 



n . 
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Remark 1. Note that for the case p = 1, by using t = (log(M))* in Theorem 5, we obtain 
the bound 



RriHl) < 



1 



16 

n 



oo 

E 



min (^rM, e^D^{\ogM)^X 



2\M 



M 



m=l 



+ 



Be^DlogiM) 



n 



(See below after the proof of Theorem 5 for a detailed justification.) 



Remark 2. The result of Theorem 6 for p > 2 can be proved using consid- 
erably simpler techniques and without imposing assumptions on boundedness nor 
on uncorrelation of the kernels. If in addition the variables (x^"*)) are cen- 
tered and uncorrelated, then the spectra are related as follows : spec(J) = 

U^^]^ spec(JTO); that is, {\i,i> 1} = Um=i {'^i"*^ ^ — l}- Then one can write equiv- 
alently the bound of Theorem 6 as Rr{Hp) < \J ^ X^^^j^ Sjt=i niin(r, D'^Mp^ ^Xj^^) = 

However, the main intended focus of this pa- 



^i||(Er=imin(r,Z)2Mi^-^;."^) 



m=l 



per is on the more challenging case 1 < p < 2 which is usually studied in multiple kernel 
learning and relevant in practice. 



Remark 3. It is interesting to compare the above bounds for the special case p = 2 
with the ones of Bartlett et al. (2005) . The main term of the bound of Theorem 6 (taking 

t = p = 2) is then essentially determined by O ^y^ ^ Em=i ™™ "^j"*^)) ■ 
variables {x^'^^) are centered and uncorrelated, by the relation between the spectra stated 
in Remark 2, this is equivalently of order O (^yj ^ Yl'jLi "lin {r, Aj) j , which is also what we 
obtain through Theorem 6, and coincides with the rate shown in Bartlett et al. (2005). 

Proof of Theorem 5 and Remark 1. The proof is based on first relating the complexity 
of the class Hp with its centered counterpart, i.e., where all functions fw G Hp are centered 
around their expected value. Then we compute the complexity of the centered class by 
decomposing the complexity into blocks, applying the no-correlation assumption, and using 
the inequalities of Holder and Rosenthal. Then we relate it back to the original class, which 
we in the final step relate to a bound involving the truncation of the particular spectra of 
the kernels. Note that it suffices to prove the result for t = p as trivially R{Hp) < R{Ht) 
for all p < t. 

Step 1: Relating the original class with the centered glass. In order to 
exploit the no-correlation assumption, we will work in large parts of the proof with the 
centered class Hp = {fw | ||w||2,p < -C}, wherein f^'-x^ {w,x), and x := x — Kx. We 
start the proof by noting that fw{x) = fw{x) — {w,E,x) = fw{x) — E{w,x) = fw{x) — 
^fw{x), so that, by the bias- variance decomposition, it holds that 

P/2 = EUixf = E iUix) - EU{x)f + {EUx)f = PJl + {PUf. (3) 
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Furthermore we note that by Jensen's inequaUty 

M ^ 2^ , M 

\Ex\ 



m=l 



Jensen 



M 



m=l 
1 

W 



m=l 



tr(J„ 



M 



m=l 



(4) 



so that we can express the complexity of the centered class in terms of the uncentered one 
as follows: 



1 " 

Rr{Hp) = E sup (w, — ^ aiXi) 

Pfl<r 

^ n 1 " 

<E sup (w, — "^^aiXi) +E sup (w, — ^crjEo;) 



i=l 



Concerning the first term of the above upper bound, using (3) we have P/^ < P/^ , and 
thus 

E sup (w, — > o-jtCi) < E sup (w, — y^ aiXi) = Rr(Hp). 



Pfl<r 



Pfl<r 



Now to bound the second term, we write 



1 " 1 " 

E sup (lu, — > (jjEa;) = E — > (tJ sup (ti7,Ea;) 



^—2 fw^Hp, 
Pfl<r 



< sup («7,Ea;) I E ( - VcTi 1 



1 

2\ 2 



= "v/n sup {w,Ex) . 

fw £ Hp , 



Now observe finally that we have 



Holder 



(4) 



{w,Ex) ||w?||2,p||Ea;||2^p. < \\w\\^^p J||(tr(J^))^^J^ 



as well as 



{w,Ex)=EU{x) < ^/Pfl. 
We finally obtain, putting together the steps above, 

Rr{Hp) < Rr{Hp) + n-5 min (V^, D ^|| (tr(J^))^^J| 



(5) 
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This shows that we at the expense of the additional summand on the right hand side we 
can work with the centered class instead of the uncentered one. 



Step 2: Bounding the complexity of the centered class. Since the (centered) 
covariancc operator Ea;'^™-' s^) S^™) is also a self-adjoint Hilbert-Schmidt operator on Hm, 
there exists an eigendecomposition 



oo 

Ei(™) ® = ^ ) ) ® ) , (6) 



wherein {uj^^)j>i is an orthogonal basis of Hm- Furthermore, the no-correlation assumption 
(U) entails Ei^') (g) x^""^ = for all I ^ m. As a consequence, 



M ^ M 



Pfl = E(/^(*))2 = E(^ («;m,x(-)>) = (.x;;,(Ei(')®i(™))«; 



m=l l,m=l 

M M oo 

m=l m=l j=l 



and, for all j and m, 



^ n ^ n 



i=l 



1,1=1 



^ — ' \ 

i=l 



i=l 



:(m) 



n 
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Let now hi, ... , hu be arbitrary nonnegative integers. We can express the local Rademacher 
complexity in terms of the eigendecompositon (6) as follows 



Rr{Hp) 



1 

E sup (w, — ajXi 



M I 1 

n 



< 



Pfl<r j=l 



~ (m) ~ (m)\ ~ (m) 
GiX] \U) ')U) 



j=i 1=1 

I oo ^ n 

+ E sup 



M 

m=l 



C.-S., Jensen 

< sup 

Pfl<r 



M h; 



M 



m=l 



m=l j=l 



M hn 



/to 



\rn=l j=l 

oo 



i=l 



j=hm+l i=l 



so that (7) and (8) yield 



< y — - — +E sup (w,[ Y i-i^^M ^^ 

' fweHp \ j=h^+l i=l ' 



Step 3: Khintchine-Kahane's and Rosenthal's inequalities. We can now use 
the Khintchine-Kahane (K.-K.) inequality (see Lemma 1 in Appendix A) to further bound 
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the right term in the above expression as follows 



E 



OO 1 " 



~ (m) ~ (m) 



j = hm + l 



' ^ )m=l 



2,p* 



Jensen 



M 



K.-K 

< 



j=hm+l i=l 
M OO ^ n 

m=l j=hm+i i=l 

^ , M OO ^ n 



Tin 



E 



Jensen 
< 



m=l j=hm+l i=l 



Note that for p > 2 it holds that p*/2 < 1, and thus it suffices to employ Jensen's inequality 
once again in order to move the expectation operator inside the inner term. In the general 
case we need a handle on the ^-th moments and to this end employ Lemma 3 (Rosenthal 
+ Young), which yields 



M 



En E ;e( 



1 E_ 



n 

m=l j=hm+i j=l 



T> , / ^ ^ / T-> P* OO ^ n p 

\ m=l ^ j=hm+i j=l 



(*) 
< 



( BM^ 



n 



M 



+ (E( E ^. 

m=l j=hm+l 



(m) \ 2 
'j 




< 



BMT^ 



n 



+ 



E ^, 



(m) 



M 



m=l 



"j=hm+i 

where for (*) we used the subadditivity of 'y^ and in the last step we applied the Lidskii- 



Mirsky-Wielandt theorem which gives Vj, m : A^'"'' < A^*"^ . Thus by the subadditivity of 



the root function 

Rr{Hp) < 



n 



+ D 



\^ n \^ n 



E 4 



(m) 



M 



m=l 



n 



+ 



ep 



n 



(m) 



E 4" 



M 



/BeDMl^p* 
+ ^. (9) 
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Step 4: Bounding the complexity of the original class. Now note that for all 
nonnegative integers hm we either have 



(in case all hm are zero) or it holds 



n-^ min (V^, D ^||(tr(J„))^=J^) < \J 



EM 7 



n 



(in case that at least one hm is nonzero) so that in any case we get 



n 2min(^V^,L>W||(tr(J^))^^Jj, 



~ V n 



n 



oo ^ M 

(m) 

i 

m=l 



(10) 



Thus the following preliminary bound follows from (5) by (9) and (10): 



Rr{Hp) < 



M , 
m=l '^m 

n 



+ 



n 



E A 



(m) 



M 



m=l 



+ 



BeDMv*p* 



n 



(11) 



for all nonnegative integers /i^, > 0. We could stop here as the above bound is already 
the one that will be used in the subsequent section for the computation of the excess loss 
bounds. However, we can work a little more on the form of the above bound to gain more 
insight in the properties — we will show that it is related to the truncation of the spectra at 
the scale r. 

Step 5: Relating the bound to the truncation of the spectra of the kernels. 
To this end, notice that for all nonnegative real numbers Ai,A2 and any ai,a2 G M!^ it 
holds for all g > 1 



^i + ^/A^ < ^y2{Al + A2) 



|ai||q + ||a2||5 < 2 9 ||ai + a2||g < 2 ||ai + 02! 



(12) 
(13) 



(the first statement follows from the concavity of the square root function and the second 
one is proved in appendix A; see Lemma 14) and thus 
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Rr{Hp) 



(12) 
< 



€i-to-£p. 

2 

< 



n 



n 



oo X M 

(m) 

'j 

j=hm+l ^ "^=1 



E AS 



+ 



n 



\ 



- I rM'"F 
n 



\ / m=l 



oo ^ M 

(m) 



j=hra + l 



m=l 



+ 



'BeDMv*p* 



n 



(13) 
< 



16 
n 



2 OO X Af 



+ 



BeDM^p* 



n 



where to obtain the second inequahty we apphed that for aU non-negative a G and 
0<q<p<ooii holds^ 



1 Holder / \ i/q i _ i 

(Vto-^P conversion) ||a||g = (1, a'?) « < (l|l||(p/,)* = M'^ p\\a\\p. (14) 

Since the above holds for all nonnegative integers hm, it follows 

Rr{Hp) < 



\ 



16 

n 



oo ^ M 

(m) 



min rM^ p^hm+ep*^D^ X-' 

hm>0 m-r f 3 

j=hm+l ^ "^=1 



+ 



BeDM^p* 



\ 



16 
n 



oo ^ M 

min (^rM^~^ , ep*^D^\f'A j 

j—l / m=l 



+ 



^DM^p* 



n 



which completes the proof of the theorem. 

Proof of the remark. To see that Remark 1 holds notice that R{Hi) < R{Hp) for all 
p> 1 and thus by choosing p = (log(M))* the above bound implies 



< 



tp* — to— 
"2" 

< 



16 
n 



oo 

^ min (rM^~^, ep*'^D'^\f'^^ 
i=i 



M 

m=l 



+ 



'BeDMVp* 



n 



\ 



16 

n 



min [rM, ep*'^M^ D'^X^- 

,y = i 



(m) 



M 



in = l 



+ 



/BeDM^p* 



n 



16 
n 



oo 

^ min [rM, e^D'^{\og Myxf'^^ 



M 

m=l 



+ 



y/Be^D{\ogM) 



n 



which completes the proof. 



Proof of Theorem 6. 



1. We denote by a' the vector with entries a| and by 1 the vector with entries all 1. 
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The eigendecomposition Eo; (g) a; = Yl'jLi '^jUj ® Uj yields 



Pfl = E{Uix)f = E{w,xf = {w, (Ex ® x)w) = V A,- {w,Ujf , (15) 



and, for all j 



1 \ 2 

i=l 



= ^ aiai {xi,Uj) {xi,Uj) " = ' E-^^^I^Xi.u^Y 



i,l=l 



1 1 
n\ \ n ^-^ 



<Xi jUj 



1=1 



n 



(16) 



Therefore, we can use, for any nonnegative integer h, the Cauchy-Schwarz inequality 
and a block-structured version of Holder's inequality (see Lemma 12) to bound the local 
Rademacher complexity as follows: 



Rr{Hp) 



1 " 

E sup (w, -y^^ajXi) 

h h ^ n 

E sup V^(w'>Wj)Mj,y' (- y](Tia;i,Wj)ttj) 



Holder 
< 



ip* — to— ^2 



< 



Jensen 
< 



< 



oo -, n 



+ {'w, ^ {-'^aiXi,Uj)uj) 



n . , 

oo ]^ ^ 



C.-S., (15), (16) rh ^ , V- , -L V- NX 

< \l hE sup {w^ > {— yaiXi,Uj)Uj) 

f-weHp j=h+i " 1=1 



+ LIE 



oo -, n 



j=^+i i=i 




n 



1=1 

oo 1 " 



n 

3=h+l 1=1 



2,p* 



, / oo ^ n 

j=/i+l i=l 



+ 



^-1 oo 



n 



E A. 

0=h+l 
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Since the above holds for all h, the result now follows from ^/A + ^/B < ^2{A + B) for all 
nonnegative real numbers A,B (which holds by the concavity of the square root function): 



Rr{Hp) < 



1 



2 2 

- min (rh + D'^M^~^ \^ 
n 0<h<n \ 



j=h+l 



- ^ min(r, D'^M^'^Xj 



4. Lower Bound 

In this subsection we investigate the tightness of our bound on the local Rademacher com- 
plexity of Hp. To derive a lower bound we consider the particular case where variables 
x^^\ . . . , x^^^ are i.i.d. For example, this happens if the original input space X is M*^, the 
original input variable X £ X has i.i.d. coordinates, and the kernels ki, . . . , Um are identical 

and each act on a different coordinate of X. 

Lemma 7. Assume that the variables a;^^) , • • • , x^^^^ are centered and identically indepen- 
dently distributed. Then, the following lower bound holds for the local Rademacher com- 
plexity of Hp for any p>l: 

Rr{Hp,D,M) > RrM{Hij)j^i/p* i). 

Proof First note that since the cc^*-* are centered and uncorrelated, that 

M 1 ^ 

m=l m=l 

Now it follows 



1 " 

Rr{Hp,D,M) = E sup {w,-^aiXi) 



1 " 

E sup (w, — y ^ CFjXj) 



1 " 

> E sup (w,— y^aiXi) 



w: 



Vm : I'-u;'"'', aj*™')^ < r/M ^ i=l 

II Il2,p — 

= ... = II 

M n 

E sup E(-^"^^^E-^-S"': 

Vm : Wrv'- ' < DM P 
II II2 ~ 

M ^ n 

Ve sup ^a.xf")) 
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so that we can use the i.i.d. assumption on x^'^^ to equivalently rewrite the last term as 

Rr{Hp,D,M) > E sup {Mw^^),-}2<^i4 ) 



i=l 

II < DM P 



1 

E sup (MwjW,- VcTixf^) 



II nMI ^ 



E sup VcTixf^ 

I (1 II 

-RrM(-ffi hmVp* l) 



In Mendelson (2003) it was shown that there is an absolute constant c so that if A^^^ > ^ 
then for all r > ^ it holds > ^ Sj=i ^y^^)- Closer ius|)cction of the proof 

reveals that more generally it holds Rr{Hi^D,i) > V^Sylimin 
so that we can use that result together with the previous lemma to obtain: 



1 



Theorem 8 (Lower bound). Assume that the kernels are centered and identically indepen- 
dently distributed. Then, the following lower bound holds for the local Rademacher complex- 
ity of Hp. There is an absolute constant c such that if A^^^ > then for all r > ^ and 
P>1, 



- min(rM, D'^MVp'xf). 



(17) 



We would like to compare the above lower bound with the upper bound of Theorem 5. To 
this end note that for centered identical independent kernels the upper bound reads 



Rr{Hp) < 



\ 



1 fi 2 

— J2 (^-^' ceD^p*'^M^xf^^ 



+ 



BeDMp'p* 



n 



which is of the order ©(y X^jLi {rM, D'^Mp* A^-^'*)) and, disregarding the quickly con- 
verging term on the right hand side and absolute constants, again matches the upper bounds 
of the previous section. A similar comparison can be performed for the upper bound of The- 
orem 6: by Remark 2 the bound reads 



Rr{Hp) < 



\ 



^||(^min(r,L>2M#^-^Aj"'))) 



M 

m=l 
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which for i.i.d. kernels becomes y 2/n min {rM,D'^M'^ Xf>) and thus, beside abso- 
lute constans, matches the lower bound. This shows that the upper bounds of the previous 
section are tight. 

5. Excess Risk Bounds 

In this section we show an application of our results to prediction problems, such as classi- 
fication or regression. To this aim, in addition to the data Xi, . . . , introduced earlier in 
this paper, let also a label sequence yi, C [—1, 1] be given that is i.i.d. generated from 

a probability distribution. The goal in statistical learning is to find a hypothesis / from 
a pregiven class F that minimizes the expected loss E/(/(a;),y), where ^ : i— t- [—1,1] 
is a predefined loss function that encodes the objective of given the learning/prediction 
task at hand. For example, the hinge loss l{t,y) = max(0, 1 — yt) and the squared loss 
l{t,y) = (t — y)^ are frequently used in classification and regression problems, respectively. 

Since the distribution generating the example/label pairs is unknown, the optimal de- 
cision function 

/* := argminE/(/(a;),y) 
/ 

can not be computed directly and a frequently used method consists of instead minimizing 

the empirical loss, 

1 " 

/ := argmin- V'i(/(a;i),yi). 

In order to evaluate the performance of this so-called empirical minimization algorithm we 
study the excess loss, 

Pilf-lf*) := El{f{x),y)-El{r{x),y). 

In Bartlett et al. (2005) and Koltchinskii (2006) it was shown that the rate of convergence of 

the excess risk is basically determined by the fixed point of the local Rademacher complexity. 
For example, the following result is a slight modification of Corollary 5.3 in Bartlett et al. 
(2005) that is well-taylored to the class studied in this paper. ^ 

Lemma 9. Let T he an absolute convex class ranging in the interval [a, b] and let I be a 
Lipschitz continuous loss with constant L. Assume there is a positive constant F such that 
V/ G J" : P{f - /*)^ < FP{lf - If*). Then, denoting by r* the fixed point of 

for all X > with probability at least 1 — the excess loss can be bounded as 

r* (llL(6-a) + 27F)x 
P{lf - If) < + . 

2. We exploit the improved constants from Theorem 3.3 in Bartlett et al. (2005) because an absolute convex 

class is star-shaped. Compared to Corollary 5.3 in Bartlett et al. (2005) we also use a slightly more general 
function class ranging in [a, h] instead of the interval [—1, 1]. This is also justified by Theorem 3.3. 
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The above result shows that in order to obtain an excess risk bound on the multi-kernel class 
Hp it suffices to compute the fixed point of our bound on the local Rademacher complexity 
presented in Section 3. To this end we show: 

Lemma 10. Assume that \\k\\^ < B almost surely and let p G [1, 2]. For the fixed point r* 
of the local Rademacher complexity 2FLRr (Hp) it holds 



r < min 

0</lm<OO 



m=l ""m 



]-8FL 



n 



n 



(m) 



M 



m=l 



AVBeDFLM^p* 



n 



Proof For this proof we make use of the bound (11) on the local Rademacher complexity. 
Defining 



^ ^ — m=l rn ^ ^ 

n 



ep 



n 



(m) 

■J 



M 



m=l 



2VBeDFLM^p* 



n 



in order to find a fixed point of (11) we need to solve for r = ^/ar + 6, which is equivalent 
to solving — (a + 2b)r + 6^ = for a positive root. Denote this solution by r*. It is 
then easy to see that r* > a+2b. Resubstituting the definitions of a and b yields the result. ■ 



We now address the issue of computing actual rates of convergence of the fixed point r* 
under the assumption of algebraically decreasing eigenvalues of the kernel matrices, this 

means, we assume 3d„i ■ A^™"^ < dmj~°''^ for some a-m > 1- This is a common assumption 
and, for example, met for finite rank kernels and convolution kernels (Williamson et al., 
2001). Notice that this implies 



< dr 



r 

J hr\ 



X °'"'dx = dr. 



Ll - Or 



oo 



(18) 



To exploit the above fact, first note that by ^p-to-^q conversion 



F-M-^-^\\{hl))^^^ 



< AF] 



2/p* 



n V V 

so that we can translate the result of the previous lemma by (12), (13), and (14) into 



r* < min 8F 

0<hm<OO 



\ n 



oo \ M 

+ 4ep*^D^L^ J2 A^."^') 



AVB^DFLM^p* 



n 



(19) 
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Inserting the result of (18) into the above bound and setting the derivative with respect to 
hm to zero we find the optimal hm as 



= (4dmep*^D^F-^L^M^-^ny 
Resubstituting the above into (19) we note that 



M 

m=l 



so that we observe that the asymptotic rate of convergence in n is determined 
by the kernel with the smallest decreasing spectrum (i.e., smallest am)- De- 
noting dmax := ToaaXm=i,...,M dm, amin := T^iT^m=i,...,M Olm, and /in 

«min we can upper-bound (19) by 



•-max 



1 



1 



am^o.r, 1 WBeDFLMI^p* 

< 8J- -F^Mh^^n-^ + 



1- am n 



V 1 - dm 

We have thus proved the following theorem, which follows by the above inequality. Lemma 9, 

1 

and the fact that our class Hp ranges in BDMp^ . 

Theorem 11. Assume that \\k\\^ < B and 3dm '■ A^"*^ < dmj~°'"' for some am > 1- Let 

I be a Lipschitz continuous loss with constant L and assume there is a positive constant F 
such that V/ G F : P{f — /*)^ < F P{lf — If*)- Then for all x > with probability at least 
1 — e~^ the excess loss of the multi-kernel class Hp can be bounded for p G [1, . . . , 2] as 



pa f-lf*) < min 1864 / - — — [d^.^^O'^ F'"^ LH*'^) i+«min M^^ i+«mi„ ^) n i+«mi„ 
^ te[p,2] V 1 - a»ri ' 

_^47VBDLMT^t* {22BDLMW +21F)x 
n n 

We see from the above bound that convergence can be almost as slow as 0[p*Mp^n 2) (if 
at least one Om ~ 1 is small and thus ctmin is small) and almost as fast as 0(ji ^) (if am 
is large for all m and thus amin is large). For example, the latter is the case if all kernels 
have finite rank and also the convolution kernel is an example of this type. 

Notice that we of course could repeat the above discussion to obtain excess risk bounds 
for the case p > 2 as well, but since it is very questionable that this will lead to new insights, 
it is omitted for simplicity. 
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6. Discussion 



In this section we compare the obtained local Rademacher bound with the global one, 
discuss related work as well as the assumption (U), and give a practical application of the 
bounds by studying the appropriateness of small/large p in various learning scenarios. 

6.1 Global vs. Local Rademacher Bounds 

In this section, we discuss the rates obtained from the bound in Theorem 11 for the excess 
risk and compare them to the rates obtained using the global Rademacher complexity bound 
of Corollary 4. To simplify somewhat the discussion, we assume that the eigenvalues satisfy 
A^™^ < (with a > 1) for all m and concentrate on the rates obtained as a function of 

the parameters n, a, M, D and p, while considering other parameters fixed and hiding them 
in a big-0 notation. Using this simplification, the bound of Theorem 11 reads 

\/te\p,2]: P{lf-lf,) = o(^{t*D)^M^^^i^~^)n~^y (21) 

On the other hand, the global Rademacher complexity directly leads to a bound on the 
supremum of the centered empirical process indexed by J-" and thus also provides a bound 
on the excess risk (see, e.g., Bousquet et al., 2004). Therefore, using Corollary 4, wherein we 
upper bound the trace of each by the constant B (and subsume it under the 0-notation) , 
we have a second bound on the excess risk of the form 

ViG[p,2]: P{lf-lf,) = o[t*DM^n-^y (22) 

First consider the case where p > (log M)*, that is, the best choice in (21) and (22) is t = p. 
Clearly, if we hold all other parameters fixed and let n grow to infinity, the rate obtained 
through the local Rademacher analysis is better since a > 1. However, it is also of interest 
to consider what happens when the number of kernels M and the Ip ball radius D can 
grow with n. In general, we have a bound on the excess risk given by the minimum of (21) 
and (22); a straightforward calculation shows that the local Rademacher analysis improves 
over the global one whenever 

Mp ^ 
O(v^). 



D 

Interestingly, we note that this "phase transition" does not depend on a (i.e. the "com- 
plexity" of the individual kernels), but only on p. 

\ip< (logM)*, the best choice in (21) and (22) is i = (logM)*. In this case taking the 
minimum of the two bounds reads 

Vj9<(logM)*: P(/^- - Z/,) < 0(^min(L>(logM)n-3, (DlogM)^MT?^n~Tf^)), (23) 

and the phase transition when the local Rademacher bound improves over the global one 
occurs for 

Finally, it is also interesting to observe the behavior of (21) and (22) as a — t- oo. In this 
case, it means that only one eigenvalue is nonzero for each kernel, that is, each kernel space 
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is one-dimensional. In other words, in this case we are in the case of "classical" aggregation 
of M basis functions, and the minimum of the two bounds reads 



yte [p,2] : P{lf-lf*) < o(mm{Mn-^,t*DMWn-^ 



(24) 



In this configuration, observe that the local Rademachcr bound is 0{M/n) and docs not 
depend on D, nor any longer; in fact, it is the same bound that one would obtain for 
the empirical risk minimization over the space of all linear combinations of the M base 
functions, without any restriction on the norm of the coefficients — the £p-norm constraint 
becomes void. The global Rademachcr bound on the other hand, still depends crucially on 
the norm constraint. This situation is to be compared to the sharp analysis of the optimal 
convergence rate of convex aggregation of M functions obtained by Tsybakov (2003) in the 
framework of squared error loss regression, which are shown to be 



This corresponds to the setting studied here with D = l,p = 1 and a — t- oo, and we see that 
the bound (23) recovers (up to log factors) in this case this sharp bound and the related 
phase transition phenomenon. 

6.2 Discussion of Assumption (U) 

Assumption (U) is arguably quite a strong hypothesis for the validity of our results (needed 
for 1 < p < 2), which was not required for the global Rademachcr bound. A similar 
assumption was made in the recent work of Raskutti et al. (2010), where a related MKL 
algorithm using an €i-type penalty is studied, and bounds are derived that depend on the 
"sparsity pattern" of the Bayes function, i.e. how many coefficients are non-zero. If the 
kernel spaces are one-dimensional, in which case ^i-penalized MKL reduces qualitatively to 
standard lasso- type methods, this assumption can be seen as a strong form of the so-called 
Restricted Isometry Property (RIP), which is known to be necessary to grant the validity 
of bounds taking into account the sparsity pattern of the Bayes function. 

In the present work, our analysis stays deliberately "agnostic" (or worst-case) with re- 
spect to the true sparsity pattern (in part because experimental evidence seems to point 
towards the fact that the Bayes function is not strongly sparse), correspondingly it could 
legitimately be hoped that the RIP condition, or Assumption (U), could be substantially 
relaxed. Considering again the special case of one-dimensional kernel spaces and the discus- 
sion about the qualitatively equivalent case a — )• oo in the previous section, it can be seen 
that Assumption (U) is indeed unnecessary for bound (24) to hold, and more specifically 
for the rate of M/n obtained through local Rademachcr analysis in this case. However, 
as we discussed, what happens in this specific case is that the local Rademachcr analysis 
becomes oblivious to the ip-noim constraint, and we are left with the standard paramet- 
ric convergence rate in dimension M. In other words, with one-dimensional kernel spaces, 
the two constraints (on the L^(P)-norm of the function and on the £p block-norm of the 
coefficients) appearing in the definition of local Rademacher complexity are essentially not 
active simultaneously. Unfortunately, it is clear that this property is not true anymore for 
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kernels of higher complexity (i.e. with a non-trivial decay rate of the eigenvalues). This is 
a specificity of the kernel setting as compared to combinations of a dictionary of M simple 
functions, and Assumption (U) was in effect used to "align" the two constraints. To sum 
up, Assumption (U) is used here for a different purpose from that of the RIP in sparsity 
analyses of i\ regularization methods; it is not clear to us at this point if this assumption is 
necessary or if uncorrelated variables a;^™) constitutes a "worst case" for our analysis. We 
did not suceed so far in relinquishing this assumption for p < 2, and this question remains 
open. 

Up to our knowledge, there is no previous existing analysis of the Ip-MKL setting for 
p > 1] the recent works of Raskutti et al. (2010) and Koltchinskii and Yuan (2010) focus 
on the case p = 1 and on the sparsity pattern of the Bayes function. A refined analysis 
of £p-regularized methods in the case of combination of M basis functions was laid out by 
Koltchinskii (2009), also taking into account the possible soft sparsity pattern of the Bayes 
function. Extending the ideas underlying the latter analysis into the kernel setting is likely 
to open interesting developments. 

6.3 Analysis of the Impact of the Norm Parameter p on the Accuracy of 
^p-norm MKL 

As outlined in the introduction, there is empirical evidence that the performance of Ip- 
norm MKL crucially depends on the choice of the norm parameter p (cf. Figure 1 in the 
introduction) . The aim of this section is to relate the theoretical analysis presented here to 
this empirically observed phenomenon. We believe that this phenomenon can be (at least 
partly) explained on base of our excess risk bound obtained in the last section. To this end 
we will analyze the dependency of the excess risk bounds on the chosen norm parameter 
p. We will show that the optimal p depends on the geometrical properties of the learning 
problem and that in general — depending on the true geometry — any p can be optimal. Since 
our excess risk bound is only formulated for p < 2, we will limit the analysis to the range 
[1,2]. 

To start with, first note that the choice of p only affects the excess risk bound in the 
factor (cf. Theorem 11 and Equation (21)) 

So we write the excess risk as P{1 ^ — If*) = 0{ut) and hide all variables and constants in the 
0-notation for the whole section (in particular the sample size n is considered a constant 
for the purposes of the present discussion). It might surprise the reader that we consider 
the term in D in the bound although it seems from the bound that it does not depend on 
p. This stems from a subtle reason that we have ignored in this analysis so far: D is related 
to the approximation properties of the class, i.e., its ability to attain the Bayes hypothesis. 
For a "fair" analysis we should take the approximation properties of the class into account. 

To illustrate this, let us assume that the Bayes hypothesis belongs to the space T-L and can 
be represented by w*; assume further that the block components satisfy ||iUmll2 = fn~^ , 
m = 1,...,M, where /3 > is a parameter parameterizing the "soft sparsity" of the 
components. For example, the cases 13 G {0.5, 1,2} are shown in Figure 2 for M = 2 and 
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(a) /3 = 2 



(b) /3 = 1 



(c) /3 = 0.5 



Figure 2: Two-dimensional illustration of the three analyzed learning scenarios, which differ 
in the soft sparsity of the Bayes hypothesis w* (parametrized by /3). Left: A soft sparse 
w* . Center: An intermediate non-sparse w* . Right: An almost-uniformly non-sparse 



assuming that each kernel has rank 1 (thus being isomorphic to M). If n is large, the best 
bias-complexity tradeoff for a fixed p will correspond to a vanishing bias, so that the best 
choice of D will be close to the minimal value such that w* G Hp^o, that is. Dp = \ \w*\\p. 
Plugging in this value for Dp, the bound factor Vp becomes 



We can now plot the value function of p for special choices of a, M, and /?. We 

realized this simulation for a = 2, M = 1000, and (3 G {0.5, 1,2}, which means we generated 
three learning scenarios with different levels of soft sparsity parametrized by f3. The results 
are shown in Figure 3. Note that the soft sparsity of w* is increased from the left hand to 
the right hand side. We observe that in the "soft sparsest" scenario (/3 = 2, shown on the 
left-hand side) the minimum is attained for a quite small p = 1.2, while for the intermediate 
case (/3 = 1, shown at the center) p = 1.4 is optimal, and finally in the uniformly non-sparse 
scenario (/3 = 2, shown on the right-hand side) the choice of p = 2 is optimal (although 
even a higher p could be optimal, but our bound is only valid for p G [1; 2]). 

This means that if the true Bayes hypothesis has an intermediately dense representa- 
tion, our bound gives the strongest generalization guarantees to £p-norm MKL using an 
intermediate choice of p. This is also intuitive: if the truth exhibits some soft sparsity but 
is not strongly sparse, we expect non-sparse MKL to perform better than strongly sparse 
MKL or the unweighted-sum kernel SVM. 

7. Conclusion 

We derived a sharp upper bound on the local Rademacher complexity of £p-norm multiple 
kernel learning under the assumption of uncorrelated kernels. We also proved a lower bound 
that matches the upper one and shows that our result is tight. Using the local Rademacher 



w*. 




ie[p,2] 
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1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0 

P P P 

(a) /3 = 2 (b) ^ = 1 (c) 13 = 0.5 

Figure 3: Results of the simulation for the three analyzed learning scenarios. The value of 
the bound factor Vf is plotted as a function of p. The minimum is attained depending on 
the true soft sparsity of the Bayes hypothesis w* (parametrized by /3). Left: An "almost 
sparse" w*. 



complexity bound, we derived an excess risk bound that attains the fast rate of 0{n ), 
where a is the mininum eigenvalue decay rate of the individual kernels. 

In a practical case study, we found that the optimal value of that bound depends on 
the true Bayes-optimal kernel weights. If the true weights exhibit soft sparsity but are not 
strongly sparse, then the generalization bound is minimized for an intermediate p. This is 
not only intuitive but also supports empirical studies showing that sparse MKL (p = 1) 
rarely works in practice, while some intermediate choice of p can improve performance. 

Of course, this connection is only valid if the optimal kernel weights are likely to be 
non-sparse in practice. Indeed, related research points in that direction. For example, 
already weak connectivity in a causal graphical model may be sufficient for all variables to 
be required for optimal predictions, and even the prevalence of sparsity in causal flows is 
being questioned (e.g., for the social sciences Gelman, 2010, argues that "There are (almost) 
no true zeros"). 

Finally, we note that seems to be a certain preference for sparse models in the scientific 
community. However, sparsity by itself should not be considered the ultimate virtue to be 
strived for — on the contrary: previous MKL research has shown that non-sparse models may 
improve quite impressively over sparse ones in practical applications. The present analysis 
supports this by showing that the reason for this might be traced back to non-sparse MKL 
attaining better generalization bounds in non-sparse learning scenarios. We remark that 
this point of view is also supported by related analyses. 

For example, it was shown by Leeb and Potscher (2008) in a fixed design setup that any 
sparse estimator (i.e., satisfying the oracle property of correctly predicting the zero values of 
the true target w*) has a maximal scaled mean squared error (MSMSE) that diverges to oo. 
This is somewhat suboptimal since, for example, least-squares regression has a converging 
MSMSE. Although this is an asymptotic result, it might also be one of the reasons for 
finding excellent (nonasymptotic) results in non-sparse MKL. In another, recent study of 
Xu et al. (2008), it was shown that no sparse algorithm can be algorithmically stable. This 
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is noticeable because algorithmic stability is connected with generalization error (Bousquet 
and Elisseeff, 2002). 
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Appendix A. Lemmata and Proofs 

The following result gives a block-structured version of Holder's inequality (e.g., Steele, 
2004). 

Lemma 12 (Block-structured Holder inequality). Let x = {x^^\ . . . jX^"^^), y = 
{y^^\. . . , 2/(™)) e'H = 'Hix ■■■ X Um- Then, for any p>l, it holds 

{x,y) < ||a;||2,p||y||2,p* • 
Proof By the Cauchy-Schwarz inequality (C.-S.), we have for all x,y eH: 

M C S ^ 

{x,y) = ^(a;(-),y(-)) <' ^ \\xh\\yh 

m=l m=l 

= <(||a;«||2,...,||a;W||2),(||y«||2,...,||y(^)||2)>. 

Holder 

< ||a3||2,p||2/||2,p* 



Proof of Lemma 3 (Rosenthal -|- Young) It is clear that the result trivially holds for 
i < p < 1 with Cq = 1 hy Jensen's inequality . In the case p > 1, we apply Rosenthal's 
inequality (Rosenthal, 1970) to the sequence Xi,. . . , Xn thereby using the optimal constants 
computed in Ibragimov and Sharakhmetov (2001), that are, Cg = 2 (g < 2) and Cq = EZ^ 
{(1 > 2), respectively, where Z \s a, random variable distributed according to a Poisson law 
with parameter A = 1. This yields 




By using that Xi < B holds almost surely, we could readily obtain a bound of the form 
on the first term. However, this is loose and for ^ = 1 does not converge to zero when 
n — )• oo. Therefore, we follow a different approach based on Young's inequality (e.g. Steele, 
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2004): 



Young 1 /Sy-C^-l) 1 / l " V 

q* \n J q \n ^ j 

q* \nj Q J 
It thus follows from (25) that for all g > ^ 

where Cq can be taken as 2 (q < 2) and MZ'^ [q > 2), respectively, where Z is Poisson- 
distributed. In the subsequent Lemma 13 we show EZ*^ < + e)"^. Clearly, for q > ^ it 
holds q + e < qe + eq = 2eq so that in any case Cq < max(2, 2eq) < 2eq, which concludes 
the result. ■ 

We use the following Lemma gives a handle on the g-th moment of a Poisson-distributed 
random variable and is used in the previous Lemma. 

Lemma 13. For the q-moment of a random variable Z distributed according to a Poisson 
law with parameter A = 1, the following inequality holds for all <? > 1: 

EZ'^'^--Y^<{q + ey. 
e-^ k\ ^ ' 
k=0 

Proof We start by decomposing EZ^ as follows: 
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Note that by Stirling's approximation it holds k\ = y/2Tre^''k (|)* with i2k+i ^ '''k < 
for all q. Thus 



°° hi OO 
k=q+l k=q+l 

OO -. 

00 ^ , 

I (^Y 

fr[ V2^e''k+9{k + q) ^kJ 
^ V2^e^kk \kJ 



(*) 



00 ^ 

Stirling ^ \ i 
fe=l 



12fe+l 



— 

where for (*) note that e'^'^k < e'^'=+«(/c + q) can be shown by some algebra using 
n<m- Now by (26) 

EZ" = -(q'^ + e«+^) < + < + e)", 
e 

which was to show. ■ 

Lemma 14. For any a, 6 G M!p holds for all q > 1 

\\a\\^ + \\b\\^<2'-l ||a + 6||^<2||a + 6||^. 

Proof Let a = (ai, . . . , a^) and b = (61, ... , 6^). Because all components of a, 6 are 
nonnegative, we have 

Vi = l,...,m: + 6?< (0^ + 6^)^ 

and thus 

||a||^+||b||^<||a + 6||^. (28) 
We conclude by £q-to-£i conversion (see (14)) 

(14) ,1 ,1 1 

ll«ll, + ll&ll, = ||(ll«ll,>l|i'lUlli < 2^-^||(ll«ll,,ll''IIJ||, = 2^"nil«ll^ + ll^'ll^)^ 

(28) 1 1 „ 

< 2'-^\\a + b\\^, 

which completes the proof. ■ 
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